The UK Data Service

Who we are

  • Five partner universities

    • UK Data Archive, University of Essex (lead partner)
    • Cathie Marsh Institute, University of Manchester
    • Jisc, University of Manchester
    • EDINA, University of Edinburgh
    • University College London
  • 90+ staff

  • Since 2012 (UKDA \(\rightarrow\) 1967); curates national data since 2003

What we do

  • The main single point of access for UK social science data
  • Secondary data collection, curation and access
  • Training and user support
  • Communication and user engagement
  • Impact
  • … Key part of the UK social science research infrastructure, funded by the UKRI/ESRC

Our data…

  • UK social survey microdata:

    • Cross-sectional: large government and academic surveys
    • Longitudinal: major studies following people over time
  • International data: survey data, aggregate databases

  • Census tables and individual data – current and historical

  • Business microdata and administrative data

  • Qualitative data: multimedia files and interview transcripts

Our training

  1. Webinars and online workshops
  2. User Conferences: four main user conferences each year
  3. Drop-in sessions: Survey, Computational Social Science and SecureLab
  4. Online learning materials: find key resources on our Learning Hub
  5. Helpdesk for individual data queries
  6. Check out our YouTube channel

Test

  • Data management planning: sharing, archiving and preserving data​
  • Anonymisation and licence/access frameworks​
  • Documentation and metadata standards​
  • Ethical and legal considerations
  • Data discovery​
  • How to access data​
  • Safe Researcher Training for the most confidential data
  • Introductions to datasets: ​

  • Datasets e.g Census, LFS​

  • Data types e.g Surveys​

  • Substantive themes e.g Crime​

  • Evaluating data for research ​

  • Preparing data/data handling​

  • Basic data analysis skills​

  • Basic software/tools​

  • Teaching with data​

  • User conferences​

  • Computational Social Science​

  • Reproducible research​

2. Why a webinar series on data linkage?

Changing data landscape

  • Once upon a time: census and surveys (lots of them!)
  • Emerging kind of data in the past 20 years
    1. Administrative data
    2. Digital trace data (i.e. social media, web…)
    3. Smart data
  • 2 and 3: genuinely new data
  • 1: Digitalisation of records \(\longrightarrow\) greater availability
  • Increased demand for personal insights: the Monitored Self
  • Potential for new research avenues at a lower cost…

Growing role of non-survey data

  • Why interest is growing
    • New and previously unavailable measurements
    • Large scale, high frequency, low marginal cost
    • Particularly in social epidemiology and socio-economic research
    • Attractiveness of ‘harder’ type of data
  • But…
    • Collected for administrative purposes, not research
    • Selective coverage and population exclusions
    • Measurement error, changing definitions, policy artefacts
    • Limited socio-demographic and subjective information

Surveys are changing…

  • Rising costs under tighter budgets
  • Increasing fieldwork complexity
  • Recruitment challenges (reach, refusals, panel fatigue)
  • Alleviation comes at a cost: larger samples, incentives
  • Growth of online and mixed-mode designs

… But they remain essential:

  • Only source for attitudes, beliefs, motivations, well-being
  • Rich socio-demographic information not captured elsewhere: detailed occupation, social class, ethnicity
  • Theory-driven measurements and validated instruments (ex: GHQ)
  • Only tested tool for population representative data, hard to reach groups and subgroup analysis

Increasing number of actors

  • Large data producers: CLS; ISER; ONS; government departments
  • Linkage intermediaries: LLC, ADR, SAIL; SDR
  • Data curator: UKDS
  • Other consortiums: CLOSER
  • Government departments; health and education providers
  • \(\longrightarrow\) not always easy for researchers and analysts to navigate the landscape

Skills needs

  • Blurring boundary between computational methods and survey research
  • Existing basic skills
    • Individual and aggregate merging
  • Managing new types of data
    • Web/social media scrapping via API
    • Data cleaning/processing
    • Probability matching
    • Inference and non random surveys
  • Navigating the increasingly complex regulatory framework

Conclusion

  • Budgetary pressures on survey data
  • Wealth of cheaper, but narrowly focused, often unrepresentative new forms of data
  • Complexification of the data provision landscape: increased opacity for non experts
  • Need to adapt the skills training/capacity building
  • Data integration enables validation and enhancement of both kinds of data (Benzeval 2020)
  • Linkage is still limited (but growing) practice and few linked datasets are available for secondary research

2. Exploring
integrated data

Working definition

  • Combining different sources of data ie:
    • Survey data \(\leftrightsquigarrow\) survey data
    • Survey data \(\leftrightsquigarrow\) non survey data
    • Non survey data \(\leftrightsquigarrow\) non survey data
  • That include a shared unit of obsservation (individual, household, area…)
  • … In a coherent way in order to:
    • Validate or
    • .. enhance the original data
  • Bidirectional

Validation (Whiffen et al 2005)

  • Effectiveness of population health surveys for estimating prevalence of chronic conditions
  • Reliability of survey-based prevalence estimates for chronic diseases is unclear
  • Data linkage to validate prevalence of selected chronic conditions:
    • Angina, myocardial infarction, heart failure, and asthma.
  • Link 11,323 adults from the 2013 and 2014 Welsh Health Survey to clinical data
  • Secure Anonymised Information Linkage (SAIL) Databank
  • Results: depends on condition:
    • Less agreement for cardiovascular, better for asthma
    • Potentially cheaper
    • Difficulty

Enhancing surveys with social media data

  • Increasing usage of linked survey and social media data
  • Typical example: asking survey respondents to have their SM behaviour tracked
  • Potentially reduce the cost of the survey
  • Subject to consent: representativeness issues
  • Success depend on the app, gendeer, etc..
  • DIGISURVOR project for an exemple of current research

Kinds of data linked to survey data

Administrative data

  • Usually arising from the interaction between:

    • An organisation (typically public body)…
    • … the unit for which records are produced (ie people)
  • Birth or death certification, mortgage or benefits application, filing tax returns, going to the GP or the A&E

  • Exemples:

    • Registry data: birth, death, marriage records,
    • Health records, educational transcripts
    • Government records: benefits, earnings/income
    • Financial reports
  • In the UK: Digital Economy Act 2017:

    “… de-identified1 data from government service providers, excluding NHS data, as part of their day-to-day functions, may be shared for public good research”

Hospital episodes data

  • NHS data about all hospital admissions in England.
  • Four datasets:
    • Episodes of using: Accident and Emergency, Admitted Patient Care, Adult Critical Care, Outpatients
    • Mostly available for 2007/9-2023
  • Data on diagnosis, maternity, mortality, mental health, treatment’s length, deprivation etc.
  • Available for the NCDS Birth Cohort

School inspection data

  • OFSTED ‘State of the nation’: anonymised data on latest schools inspections outcomes of 22,000 open schools

  • Linked with the MCS, currently covers years 2005 to 2019

  • Data on a wide range of topics i.e.:

    • Quality of teaching, learning and assessment
    • Effectiveness of leadership and management
    • Pupils’ achievement (aggregated) (2005-2015)
    • Behaviour and safety of pupils (2005-2015)

NEST pension data

  • (National Employment Saving Trust)

  • Covers 1,000,000 employers, 11 millions employees

  • Linked to consenting Understanding Wave 11 respondents (about 12,000)

  • Data about:

    • Employer and employee characteristics
    • Current pension status
    • Pension contributions characteristics

Smart data

  • Fuzzier definition
    • Measured data
    • Not traditionally associated with social research
    • Flow/real time data
  • Personal devices:
    • Accelerometer
    • Geolocation
    • Photographic
  • Energy, financial, shopping habits
  • Smart Data Research

Bio measurement

  • Blood sample
  • Epigenetic data
  • Cortisol

Which survey data are most commonly linked?

Linked survey data: in theory

  • Depends on:

    • The topic covered by the data linked i.e. does it match common topics studied in surveys?
    • The survey itself (i.e. does it include the required linking information / user consent)
    • … Scope of the surveys i.e. is linkage part of the original data collection, or is it a subsequent project?
    • Means available from the data producer

… And in practice

  • Major longitudinal studies:

    • Birth cohort studies
    • Next Steps and ELSA
    • Understanding Society
  • A few large scale cross-sectional surveys such as:

    • ASHE (Annual Survey of Hours and Earnings)
    • Family Resources Survey
    • Scottish Health Survey (project)

Birth cohort studies

  • Follow a sample of individuals over their whole life
  • Born during a specific period of 1958(NCDS), 1970(BCS), 2000 (MCS), 2026 (Generation New Era)
  • Millenium Cohort Study (MCS)
    • ~ 19,000 children (born between June 2001 and Jan 03)
    • 7 ‘sweeps’: 9 months then at 3, 5, 7, 11, 14, years old
    • Parent and child interviews
    • Focuses on education, skills and health, truancy, cognitive ability, biological measurements
    • … Traditional socio-economic and demographic data

Other cohort studies

  • Next Steps

    • AKA Longitudinal Study of Young People in England
    • 16,000 people in England born 1980-90, from secondary school age (i.e. 13-14) onwards
    • Set up by DfE to study determinants of school outcomes
  • ELSA (English Longitudinal Study of Ageing):

    • Follows a sample of 19,000 people aged over 50 to understand all aspects of ageing in England.
    • Started in 2002, biennial waves.
    • Data on physical and mental health (incl. well-being), financial circumstances, and attitudes about ageing.

Understanding Society (1)

  • Largest longitudinal study of the UK population

  • Initial sample size: 40K households, 100K individuals

  • 14 waves so far: 2009-23. Includes BHPS data 1991-2009

  • Ethnic minority boost samples, innovation panel

  • Very wide range of topics covered:

    • Employment, income, benefits, savings, debt, and assets
    • Health, well-being, and health behaviours
    • Housing, housing costs, and dwelling characteristics

Understanding Society (2)

  • Further topics:

    • Family, partnerships, caring responsibilities,
    • Education, training
    • Expenditure, consumption, deprivation
    • Social attitudes, values, political opinions
    • Transport, mobility, and commuting patterns
    • Environmental behaviours, and related attitudes

Integrated datasets curated by UKDS

Next Steps: Student Loans Data

  • Data on higher education loans for Next Steps participant

    • who provided consent to linkage in the age 25 sweep.
  • Information about:

    • Full Next Steps dataset +
    • applications for student finance,
    • payment transactions posted to participant’s accounts,
    • repayment details and
    • overseas assessment details.

    Hospitalisation Episodes Data (SN8681)

MCS: National Pupil Database

  • Data for children in England whose carer gave consent

  • Data from National Pupil Database and the Pupil Level Annual School Census.

    • Pupil level school census data from N1 to year 11 (2016/17)
    • KS1, KS2, KS4 and KS5 results (Years 2, 6, 11, 12 and 13)
    • Absence data from year 1 to year 11
    • School characteristics and school changes: N1 to year 11
    • Anonymised School identifiers (URN) and anonymised Local Education Authorities (LEA)
  • Also available for Next Steps and Understanding Society

    Ofsted Reports (SN9436)-

Vacancy Survey (SN7421)

  • Statutory, monthly survey of ~6,000 GB businesses
  • Single question:
    • “How many job vacancies for which actively seeking recruits from outside their organisation?”
  • Sample drawn from the Inter-Departmental Business Register (HMT, collected from VAT and PAYE registers)
  • Number of vacancies
    • via linkage ISCO code (industrial activities classification), number of employees
  • Data available from 2005 to 2025
  • Potential for additional linkage via IDBR

3. Overview of Data integration techniques and skills needs

Computational social science

  • Web scrapping (Python/R)
  • API queries (X, Reddit…)
  • Pattern detection ie random forest
  • Data cleaning (Pandas, Tidyverse)

Data matching

  • Traditional merging:
    • The simplest case: individual level data matched to individual level data non ambiguous identifier
    • The same holds to aggregate level (for example smart sensor small area level matching)
  • Deterministic vs probabilistic matching
  • When separate ids
  • When data is not clean

4. Who’s who in the data integration landscape

Administrative Data Research (ADR)

  • Consortium of organisations, including the ONS, devolved governments and academic partners
  • Mission:
    • “link…
    • and open upde identified data
    • generated from people’s interactions with public services,
    • making it securely available to accredited researchers.
  • Point of access for new data linkage within the public sector and between the public sector and researchers

UK LLC

  • Trusted Research Environment - TRE

  • Currently enables linkage between longitudinal studies and data from:

    • NHS England
    • Neighbourhood geographies1
    • Address geographies2
  • Planned

    • NHS Wales
    • DWP
    • HMRC

UKDS

  • Curation and access to survey data
  • Main gateway for secondary survey data analysis
  • Curates some linked data

Data producers

  • Data producers of the main longitudinal studies ie
    • Understanding Society (ISER)
    • Main cohort studies (CLS)
  • Involved in data matching (ea consent management)
  • Need to be consulted when data is linked
  • Closer - coordination & cross studies harmonisation
  • Government departments and the ONS

Secure Anonymised Information Linkage -SAIL

  • Wales based, but data from across the UK
  • Trusted Research Environment - TRE
  • Provides access to (mostly) health-related admin data
  • Enables data linkage between some of these to accredited researchers

References

Millennium Cohort Study: Linked Education Administrative Datasets (National Pupil Database - KS1-KS5), England, 2003-2021: Secure Access

Next Steps: Linked Administrative Datasets (Student Loans Company Records), 2007 - 2021: Secure Access

Vacancy Survey, 2005-2025: Secure Access

Grant, P. (2024) The Monitored Self In: The Virtual Hospital Springer, Cham.

Silber, H., Breuer, J., Beuthner, C., Gummer, T., Keusch, F., Siegers, P., … Weiß, B. (2022). Linking Surveys and Digital Trace Data: Insights From two Studies on Determinants of Data Sharing Behaviour Journal of the Royal Statistical Society, Series A (Statistics in Society), 185(Suppl. 2), 387-407.

Whiffen, T; Akbari, A ; Paget, T ; Lowe, S; Lyons, R (2020) How effective are population health surveys for estimating prevalence of chronic conditions compared to anonymised clinical data?, International Journal of Population Data Science (IJPDS) Vol 5:1

5. Data linkage at UKDS: roles, routes, and researcher options

The UK Data Service and data linkage:
core principles

  • UKDS does not create linkages or integrate data
  • Linked data are created by data owners or processors
  • UKDS negotiates access to these data collections and makes them research-ready and safely accessible
  • The type of linkage researchers can undertake depends on:
    • the access level (Open / Safeguarded / Controlled)
    • the presence or absence of identifiers

What researchers can do in UKDS SecureLab

  • Researchers can:
    • Access more granular variables​
    • Create derived or contextual linkages, for example:​
    • Environmental or pollution deciles based on postcode-derived measures​
    • Area-level deprivation or service access indicators​
    • Import external datasets subject to depositor approval​
  • Key considerations:
    • UKDS SecureLab does not host direct identifiers
    • Researchers cannot create linkage spines or perform identifier-based matching​
    • All linkage activity must be explicitly approved as part of the project

6. How to access linked data at UKDS